Background & Context
The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged on every user irrespective of usage, while others are charged under specified circumstances.
Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers’ and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas
You as a Data scientist at Thera bank need to come up with a classification model that will help bank improve their services so that customers do not renounce their credit cards
Objective
Data Dictionary:
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Libraries to tune model, get different metric scores, and split data
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import Pipeline, make_pipeline
#libraries to help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
BaggingClassifier,
RandomForestClassifier)
from xgboost import XGBClassifier
#Library for statistics
import scipy.stats as stats
data = pd.read_csv("BankChurners.csv")
data
data.info()
for feature in data.columns:
if data[feature].dtype == 'object':
data[feature] = data[feature].astype('category')
data.info()
data.nunique()
CLIENTNUM as it is unique for each customer and will not add value to the model.# Dropping column CLIENTNUM
data.drop("CLIENTNUM", axis=1, inplace=True)
data.describe().T
Customer_Age: The customer's age ranges from 26 to 73 years old with an average / median age of 46 years old.Dependent_count: The number of depedent ranges from 0 to 5 with an average of 2.3 dependents per customer.Month_on_book: Customers have been with the bank for at least 13 months with some customers with the bank for up to 56 months. The average period with the bank for customers is 35.9 months.Total_Relationship_Count: The total number of products held by the customers ranges from 1 to 6 with an average of 3.8.Months_Inactive_12_mon: The number of months inactive in the last 12 months ranges from 0 to 6 months with an average of 2.3 months of inactivity.Contacts_Count_12_mon: The number of contacts in the last 12 months ranges from 0 to 6 with an average of 2.5.Credit_Limit: The credit limit on the credit card ranges from 1,438 to 34,516 with an average of 8,632. This reflects the differences in purchasing power among customers.Total_Revolving_Bal: The total revolving balance on the credit card ranges from 0 to 2,517 with an average of 1,163.Avg_Open_To_Buy: The open to buy credit line (Average of last 12 months) ranges from 3 to 34,516 with an average of 7,469.Total_Amt_Chng_Q4_Q1: The change in transaction amount (Q4/Q1 ratio) ranges from 0 to 3.4 and with an average of 0.76 (i.e. < 1). This reflects a drop in transations between Q1 and Q4 with most customers purchasing less with their credit card.Total_Trans_Amt: The total transaction amount over the last 12 months ranges from 510 to 18,484 with an average of 4,404. This further reflects the differences in purchasing power among customers.Total_Trans_Ct: The total transaction count over the last 12 months ranges from 10 to 139 with an average of 64.9 transactions per customer within a year.Total_Ct_Chng_Q4_Q1: The change in transaction count (Q4/Q1 ratio) ranges from 0 to 3.7 with an average of 0.71 (i.e. < 1). This further reflects a drop in transations activity between Q1 and Q4 with most customers having fewer transactions.Avg_Utilization_Ratio: The average card utilization ratio ranges from 0 to 1 with an average of 0.27 indicating that rather than using their credit card, most of the time customers prefer to use alternative methods of payment.for feature in data.columns:
if data[feature].dtype not in ['int64', 'float64']:
print(data[feature].value_counts())
print(40 * '-')
Attrition_Flag: The data set is imbalanced as there are more existing customers than attrited customers. This will need to be taken into account when predicting attrited customers.Gender: There are slightly more female customers than male customers.Education_Level: Education level of customers is split into seven categories: unknown, uneducated, high school, college, graduate, post-graduate, doctorateMarital_status: Marital status is split into four categories: married, single, unknown, divorced. Most customers are either married or single.Income_Category: Income category is split into six categories: unknown, less than 40K, 40K - 60K, 60K - 80K, 80K - 120K, 120K+ with most customers earning less than 40K USD.Card_Category: Card category is split into four categories: blue, silver, gold, platinum. Most customers own a blue card.# While doing uni-variate analysis of numerical variables we want to study their central tendency
# and dispersion.
# Let us write a function that will help us create boxplot and histogram for any input numerical
# variable.
# This function takes the numerical column as the input and returns the boxplots
# and histograms for the variable.
# Let us see if this help us write faster and cleaner code.
def histogram_boxplot(feature, figsize=(15,7), bins = None):
""" Boxplot and histogram combined
feature: 1-d feature array
figsize: size of fig (default (9,8))
bins: number of bins (default None / auto)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(nrows = 2, # Number of rows of the subplot grid= 2
sharex = True, # x-axis will be shared among all subplots
gridspec_kw = {"height_ratios": (.25, .75)},
figsize = figsize
) # creating the 2 subplots
sns.boxplot(x=feature, data=data, ax=ax_box2, showmeans=True, color='violet') # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(x=feature, data=data, kde=False, ax=ax_hist2, bins=bins,palette="winter") if bins else sns.histplot(x=feature, data=data, kde=False, ax=ax_hist2) # For histogram
ax_hist2.axvline(data[feature].mean(), color='green', linestyle='--') # Add mean to the histogram
ax_hist2.axvline(data[feature].median(), color='black', linestyle='-') # Add median to the histogram
# Observations on Customer_age
histogram_boxplot("Customer_Age")
# Observations on dependent count
histogram_boxplot("Dependent_count")
# Observations on months on book
histogram_boxplot("Months_on_book")
The distribution of months on book seems normal except for the following points:
This seems to reflect prior treatment of missing values (using median value) and outliers (capping extreme values to a minimum or maximum)
# Observations on total relationship counts
histogram_boxplot("Total_Relationship_Count")
# Observations on months inactivity over last 12 months
histogram_boxplot("Months_Inactive_12_mon")
# Observations on contacts count
histogram_boxplot("Contacts_Count_12_mon")
# Observations on credit limit
histogram_boxplot("Credit_Limit")
# Observations on total revolving balance
histogram_boxplot("Total_Revolving_Bal")
# Observations on open to buy credit line
histogram_boxplot("Avg_Open_To_Buy")
# Observations on change in transaction amount
histogram_boxplot("Total_Amt_Chng_Q4_Q1")
print("Amount of customers purchasing less in Q4 than in Q1: {:.1f}%".format(len(data[data["Total_Amt_Chng_Q4_Q1"] < 1]) / len(data) * 100))
# Observations on total transaction amount
histogram_boxplot("Total_Trans_Amt")
# Observations on total transaction count
histogram_boxplot("Total_Trans_Ct")
# Observations on change in transaction counts
histogram_boxplot("Total_Ct_Chng_Q4_Q1")
print("Amount of customers making fewer transactions in Q4 than in Q1: {:.1f}%".format(len(data[data["Total_Ct_Chng_Q4_Q1"] < 1]) / len(data) * 100))
# Observations on average utilization ratio
histogram_boxplot("Avg_Utilization_Ratio")
print("Amount of customers not using their credit card: {:.1f}%".format(len(data[data["Avg_Utilization_Ratio"] == 0]) / len(data) * 100))
def perc_on_bar(feature, figsize=(10,5)):
'''
plot
feature: categorical feature
the function won't work if a column is passed in hue parameter
'''
total = len(data[feature]) # length of the column
plt.figure(figsize=figsize)
ax = sns.countplot(x=feature, data=data)
for p in ax.patches:
percentage = '{:.1f}%'.format(100 * p.get_height()/total) # percentage of each class of the category
x = p.get_x() + p.get_width() / 2 - .05 # width of the plot
y = p.get_height() + 30 # hieght of the plot
ax.annotate(percentage, (x, y), size = 12) # annotate the percantage
plt.show()
# Observations on attrition
perc_on_bar('Attrition_Flag')
# Observations on gender
perc_on_bar('Gender')
# Observations on education level
perc_on_bar('Education_Level')
# Observations on marital status
perc_on_bar('Marital_Status')
# Observations on income category
perc_on_bar('Income_Category')
# Observations on card category
perc_on_bar('Card_Category')
sns.pairplot(data, hue="Attrition_Flag");
data.groupby('Attrition_Flag').mean()
cols = ['Total_Relationship_Count', 'Months_Inactive_12_mon', 'Total_Revolving_Bal',
'Total_Trans_Amt', 'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio', 'Contacts_Count_12_mon']
plt.figure(figsize=(17,30))
for i, variable in enumerate(cols):
plt.subplot(5,3,i+1)
sns.boxplot(x='Attrition_Flag', y=variable, data=data, showfliers=False)
plt.title(variable)
plt.show()
Compared to exisiting customers, attrited customers tend to:
### Function to plot stacked bar charts for categorical columns
def stacked_plot(x, figsize=(10, 5)):
tab1 = pd.crosstab(x, data["Attrition_Flag"], margins=True)
print(tab1)
print("-" * 120)
tab = pd.crosstab(x, data["Attrition_Flag"], normalize="index")
tab.plot(kind="bar", stacked=True, figsize=figsize)
plt.legend(loc="lower left", frameon=False)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
stacked_plot(data["Gender"])
# Converting Attrition_flag to numerical variable for t-test
attrition = {'Existing Customer':0, 'Attrited Customer':1}
data['Attrition_Flag']=data['Attrition_Flag'].map(attrition)
data['Attrition_Flag']=data['Attrition_Flag'].astype('int')
# T-test to check dependency of attrition on gender
Ho = "Gender has no effect on attrition rate" # Stating the Null Hypothesis
Ha = "Gender has an effect on attrition rate" # Stating the Alternate Hypothesis
x = np.array(data[data['Gender'] == 'M'].Attrition_Flag) # Selecting attrition values corresponding to males as an array
y = np.array(data[data['Gender'] == 'F'].Attrition_Flag) # Selecting attrition values corresponding to females as an array
t, p_value = stats.ttest_ind(x,y, axis = 0) #Performing an Independent t-test
if p_value < 0.05: # Setting our significance level at 5%
print(f'{Ha} as the p_value: {p_value.round(4)} < 0.05')
else:
print(f'{Ho} as the p_value: {p_value.round(4)} > 0.05')
stacked_plot(data["Education_Level"])
stacked_plot(data["Marital_Status"])
# Test to see if attrition rate for customers having different marital status is significantly different
Ho = "Marital status has no effect on attrition rate" # Stating the Null Hypothesis
Ha = "Marital status has an effect on attrition rate" # Stating the Alternate Hypothesis
divorced = data[data['Marital_Status'] == 'Divorced']['Attrition_Flag']
married = data[data['Marital_Status'] == 'Married']['Attrition_Flag']
single = data[data['Marital_Status'] == 'Single']['Attrition_Flag']
unknown = data[data['Marital_Status'] == 'Unknown']['Attrition_Flag']
f_stat, p_value = stats.f_oneway(divorced,married,single,unknown)
if p_value < 0.05: # Setting our significance level at 5%
print(f'{Ha} as the p_value: {p_value.round(3)} < 0.05')
else:
print(f'{Ho} as the p_value: {p_value.round(2)} > 0.05')
stacked_plot(data["Income_Category"])
# Test to see if attrition rate for customers having different income category is significantly different
Ho = "Income category has no effect on attrition rate" # Stating the Null Hypothesis
Ha = "Income category has an effect on attrition rate" # Stating the Alternate Hypothesis
IC_40 = data[data['Income_Category'] == 'Less than $40K']['Attrition_Flag']
IC_40_60 = data[data['Income_Category'] == '$40K - $60K']['Attrition_Flag']
IC_60_80 = data[data['Income_Category'] == '$60K - $80K']['Attrition_Flag']
IC_80_120 = data[data['Income_Category'] == '$80K - $120K']['Attrition_Flag']
IC_120 = data[data['Income_Category'] == '$120K +']['Attrition_Flag']
IC_unknown = data[data['Income_Category'] == 'Unknown']['Attrition_Flag']
f_stat, p_value = stats.f_oneway(IC_40,IC_40_60,IC_60_80,IC_80_120,IC_120,IC_unknown)
if p_value < 0.05: # Setting our significance level at 5%
print(f'{Ha} as the p_value: {p_value.round(3)} < 0.05')
else:
print(f'{Ho} as the p_value: {p_value.round(2)} > 0.05')
stacked_plot(data["Card_Category"])
# Test to see if attrition rate for customers having different type of credit card is significantly different
Ho = "Card category has no effect on attrition rate" # Stating the Null Hypothesis
Ha = "Card category has an effect on attrition rate" # Stating the Alternate Hypothesis
blue = data[data['Card_Category'] == 'Blue']['Attrition_Flag']
silver = data[data['Card_Category'] == 'Silver']['Attrition_Flag']
gold = data[data['Card_Category'] == 'Gold']['Attrition_Flag']
platinum = data[data['Card_Category'] == 'Platinum']['Attrition_Flag']
f_stat, p_value = stats.f_oneway(blue,silver,gold,platinum)
if p_value < 0.05: # Setting our significance level at 5%
print(f'{Ha} as the p_value: {p_value.round(3)} < 0.05')
else:
print(f'{Ho} as the p_value: {p_value.round(2)} > 0.05')
plt.figure(figsize=(12,7))
sns.heatmap(data.corr(),annot=True, fmt='.2f', vmin=-1);
sns.lmplot(data=data, y='Avg_Open_To_Buy', x='Credit_Limit', hue='Attrition_Flag');
sns.lmplot(data=data, y='Total_Trans_Amt', x='Total_Trans_Ct', hue='Attrition_Flag');
sns.lmplot(data=data, y='Months_on_book', x='Customer_Age', hue='Attrition_Flag');
sns.lmplot(data=data, y='Total_Revolving_Bal', x='Avg_Utilization_Ratio', hue='Attrition_Flag');
data.drop(columns=['Months_on_book', 'Avg_Open_To_Buy', 'Card_Category', 'Marital_Status'], inplace=True)
# Separating target variable and other variables
X = data.drop('Attrition_Flag', axis=1)
y = data["Attrition_Flag"]
X=pd.get_dummies(X, drop_first=True)
X.shape
After encoding, there are 24 variables
# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=1, stratify=y
)
print(X_train.shape, X_test.shape)
## Function to calculate different metric scores of the model - Accuracy, Recall and Precision
def get_metrics_score(model,train,test,train_y,test_y,flag=True):
'''
model : classifier to predict values of X
'''
# defining an empty list to store train and test results
score_list=[]
pred_train = model.predict(train)
pred_test = model.predict(test)
train_acc = model.score(train,train_y)
test_acc = model.score(test,test_y)
train_recall = metrics.recall_score(train_y,pred_train)
test_recall = metrics.recall_score(test_y,pred_test)
train_precision = metrics.precision_score(train_y,pred_train)
test_precision = metrics.precision_score(test_y,pred_test)
score_list.extend((train_acc,test_acc,train_recall,test_recall,train_precision,test_precision))
# If the flag is set to True then only the following print statements will be dispayed. The default value is set to True.
if flag == True:
print("Accuracy on training set : ",model.score(train,train_y))
print("Accuracy on test set : ",model.score(test,test_y))
print("Recall on training set : ",metrics.recall_score(train_y,pred_train))
print("Recall on test set : ",metrics.recall_score(test_y,pred_test))
print("Precision on training set : ",metrics.precision_score(train_y,pred_train))
print("Precision on test set : ",metrics.precision_score(test_y,pred_test))
return score_list # returning the list with train and test scores
def make_confusion_matrix(model,y_actual,labels=[1, 0]):
'''
model : classifier to predict values of X
y_actual : ground truth
'''
y_predict = model.predict(X_test)
cm=metrics.confusion_matrix( y_actual, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["Actual - No","Actual - Yes"]],
columns = [i for i in ['Predicted - No','Predicted - Yes']])
group_counts = ["{0:0.0f}".format(value) for value in
cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in
zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=labels,fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
# Logistic regression and standard scaling embedded in pipeline
lr = Pipeline(steps=[
("scaler", StandardScaler()),
("log_reg", LogisticRegression(random_state=1))
])
lr.fit(X_train, y_train)
#Using k-fold cross validation
scoring='recall'
kfold=StratifiedKFold(n_splits=20,shuffle=True,random_state=1) #Setting number of splits equal to 20
cv_result_lr=cross_val_score(estimator=lr, X=X_train, y=y_train, scoring=scoring, cv=kfold)
#Plotting boxplots for CV scores of model defined above
plt.boxplot(cv_result_lr)
plt.show()
print("Average recall on validation set: {:.1f}%".format(cv_result_lr.mean() * 100))
#Calculating different metrics
scores_LR = get_metrics_score(lr,X_train,X_test,y_train,y_test)
# creating confusion matrix
make_confusion_matrix(lr,y_test)
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state = 1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
print("Before Under Sampling, counts of label '1': {}".format(sum(y_train==1)))
print("Before Under Sampling, counts of label '0': {} \n".format(sum(y_train==0)))
print("After Under Sampling, counts of label '1': {}".format(sum(y_train_un==1)))
print("After Under Sampling, counts of label '0': {} \n".format(sum(y_train_un==0)))
print('After Under Sampling, the shape of train_X: {}'.format(X_train_un.shape))
print('After Under Sampling, the shape of train_y: {} \n'.format(y_train_un.shape))
lr_under = Pipeline(steps=[
("scaler", StandardScaler()),
("log_reg", LogisticRegression(random_state=1))
])
lr_under.fit(X_train_un, y_train_un)
#Using k-fold cross validation
scoring='recall'
kfold=StratifiedKFold(n_splits=20,shuffle=True,random_state=1) #Setting number of splits equal to 20
cv_result_lr_under=cross_val_score(estimator=lr_under, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold)
#Plotting boxplots for CV scores of model defined above
plt.boxplot(cv_result_lr_under)
plt.show()
print("Average recall on validation set: {:.1f}%".format(cv_result_lr_under.mean() * 100))
#Calculating different metrics
scores_LR_under = get_metrics_score(lr_under,X_train_un,X_test,y_train_un,y_test)
# creating confusion matrix
make_confusion_matrix(lr_under,y_test)
from imblearn.over_sampling import SMOTE
print("Before UpSampling, counts of label '1': {}".format(sum(y_train==1)))
print("Before UpSampling, counts of label '0': {} \n".format(sum(y_train==0)))
sm = SMOTE(sampling_strategy = 1 ,k_neighbors = 5, random_state=1) #Synthetic Minority Over Sampling Technique
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
print("After UpSampling, counts of label '1': {}".format(sum(y_train_over==1)))
print("After UpSampling, counts of label '0': {} \n".format(sum(y_train_over==0)))
print('After UpSampling, the shape of train_X: {}'.format(X_train_over.shape))
print('After UpSampling, the shape of train_y: {} \n'.format(y_train_over.shape))
lr_over = Pipeline(steps=[
("scaler", StandardScaler()),
("log_reg", LogisticRegression(random_state=1))
])
lr_over.fit(X_train_over, y_train_over)
#Using k-fold cross validation
scoring='recall'
kfold=StratifiedKFold(n_splits=20,shuffle=True,random_state=1) #Setting number of splits equal to 20
cv_result_lr_over=cross_val_score(estimator=lr_over, X=X_train_over, y=y_train_over, scoring=scoring, cv=kfold)
#Plotting boxplots for CV scores of model defined above
plt.boxplot(cv_result_lr_over)
plt.show()
print("Average recall on validation set: {:.1f}%".format(cv_result_lr_over.mean() * 100))
#Calculating different metrics
scores_LR_over = get_metrics_score(lr_over,X_train_over,X_test,y_train_over,y_test)
# creating confusion matrix
make_confusion_matrix(lr_over,y_test)
%%time
# Choose the type of classifier.
lr_reg = Pipeline(steps=[
("scaler", StandardScaler()),
("log_reg", LogisticRegression(random_state=1,solver='saga'))
])
# Grid of parameters to choose from
parameters = {'log_reg__C': [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1]}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=lr_reg, param_grid=parameters, scoring=scorer, cv=10)
# Fitting parameters in GridSeachCV
grid_cv.fit(X_train_over, y_train_over)
print(
"Best Parameters:{} \nScore: {}".format(grid_cv.best_params_, grid_cv.best_score_)
)
# Set the clf to the best combination of parameters
lr_reg = grid_cv.best_estimator_
# Fit the best algorithm to the data.
lr_reg.fit(X_train_over, y_train_over)
#Calculating different metrics
get_metrics_score(lr_reg,X_train_over,X_test,y_train_over,y_test)
# creating confusion matrix
make_confusion_matrix(lr_reg,y_test)
models = [] # Empty list to store all the models
# Appending pipelines for each model into the list
models.append(
(
"DTREE",
Pipeline(
steps=[
("scaler", StandardScaler()),
("decision_tree", DecisionTreeClassifier(class_weight='balanced', random_state=1)),
]
),
)
)
models.append(
(
"RF",
Pipeline(
steps=[
("scaler", StandardScaler()),
("random_forest", RandomForestClassifier(class_weight='balanced', random_state=1)),
]
),
)
)
models.append(
(
"BAG_DTREE",
Pipeline(
steps=[
("scaler", StandardScaler()),
("bagging_decision_tree", BaggingClassifier(DecisionTreeClassifier(class_weight='balanced', random_state=1),
random_state=1))
]
),
)
)
models.append(
(
"BAG_LR",
Pipeline(
steps=[
("scaler", StandardScaler()),
("bagging_logistic_reg", BaggingClassifier(LogisticRegression(class_weight='balanced', max_iter=1000),
random_state=1))
]
),
)
)
models.append(
(
"GBM",
Pipeline(
steps=[
("scaler", StandardScaler()),
("gradient_boosting", GradientBoostingClassifier(random_state=1)),
]
),
)
)
models.append(
(
"ADB",
Pipeline(
steps=[
("scaler", StandardScaler()),
("adaboost", AdaBoostClassifier(random_state=1)),
]
),
)
)
models.append(
(
"XGB",
Pipeline(
steps=[
("scaler", StandardScaler()),
("xgboost", XGBClassifier(random_state=1,eval_metric='logloss')),
]
),
)
)
results = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
for name, model in models:
scoring = "recall"
kfold = StratifiedKFold(
n_splits=40, shuffle=True, random_state=1
) # Setting number of splits equal to 40
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scoring, cv=kfold
)
results.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean() * 100))
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
%%time
# Creating pipeline
pipe = make_pipeline(StandardScaler(), DecisionTreeClassifier(random_state=1, class_weight='balanced'))
# Parameter grid to pass in GridSearchCV
param_grid = {'decisiontreeclassifier__max_depth': [3, 5, 7, 10, 15],
'decisiontreeclassifier__max_leaf_nodes': [8, 32, 64, 128, 256],
'decisiontreeclassifier__min_samples_leaf': [1, 2, 4, 6, 8, 10],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=pipe, param_grid=param_grid, scoring=scorer, cv=5)
# Fitting parameters in GridSeachCV
grid_cv.fit(X_train, y_train)
print(
"Best Parameters:{} \nScore: {}".format(grid_cv.best_params_, grid_cv.best_score_)
)
# Creating new pipeline with best parameters
dtree_tuned1 = make_pipeline(
StandardScaler(),
DecisionTreeClassifier(
class_weight='balanced',
max_depth=7,
max_leaf_nodes=32,
min_samples_leaf=8,
random_state=1,
),
)
# Fit the model on training data
dtree_tuned1.fit(X_train, y_train)
# Calculating different metrics
get_metrics_score(dtree_tuned1,X_train,X_test,y_train,y_test)
# creating confusion matrix
make_confusion_matrix(dtree_tuned1,y_test)
%%time
# Creating pipeline
pipe = make_pipeline(StandardScaler(), DecisionTreeClassifier(random_state=1, class_weight='balanced'))
# Parameter grid to pass in GridSearchCV
param_grid = {'decisiontreeclassifier__max_depth': [3, 5, 7, 10, 15],
'decisiontreeclassifier__max_leaf_nodes': [8, 32, 64, 128, 256],
'decisiontreeclassifier__min_samples_leaf': [1, 2, 4, 6, 8, 10],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
dtree_tuned2 = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid, n_iter=20, scoring=scorer, cv=5, random_state=1)
# Fitting parameters
dtree_tuned2.fit(X_train, y_train)
print(
"Best Parameters:{} \nScore: {}".format(dtree_tuned2.best_params_, dtree_tuned2.best_score_)
)
%%time
#Creating pipeline
pipe=make_pipeline(StandardScaler(), XGBClassifier(random_state=1,eval_metric='logloss'))
#Parameter grid to pass in GridSearchCV
param_grid={'xgbclassifier__n_estimators':np.arange(50,250,50),'xgbclassifier__scale_pos_weight':[1,5,10],
'xgbclassifier__learning_rate':[0.01,0.05,0.1], 'xgbclassifier__gamma':[0,1,5],
'xgbclassifier__subsample':[0.8,0.9,1]}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling GridSearchCV
grid_cv = GridSearchCV(estimator=pipe, param_grid=param_grid, scoring=scorer, cv=5)
#Fitting parameters in GridSeachCV
grid_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(grid_cv.best_params_,grid_cv.best_score_))
# Creating new pipeline with best parameters
xgb_tuned1 = make_pipeline(
StandardScaler(),
XGBClassifier(
random_state=1,
n_estimators=50,
scale_pos_weight=10,
subsample=0.8,
learning_rate=0.05,
gamma=5,
eval_metric='logloss',
),
)
# Fit the model on training data
xgb_tuned1.fit(X_train, y_train)
# Calculating different metrics
get_metrics_score(xgb_tuned1,X_train,X_test,y_train,y_test)
# creating confusion matrix
make_confusion_matrix(xgb_tuned1,y_test)
%%time
# Creating pipeline
pipe=make_pipeline(StandardScaler(), XGBClassifier(random_state=1,eval_metric='logloss'))
#Parameter grid to pass in GridSearchCV
param_grid={'xgbclassifier__n_estimators':np.arange(50,250,50),'xgbclassifier__scale_pos_weight':[1,5,10],
'xgbclassifier__learning_rate':[0.01,0.05,0.1], 'xgbclassifier__gamma':[0,1,5],
'xgbclassifier__subsample':[0.8,0.9,1]}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
xgb_tuned2 = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid, n_iter=20, scoring=scorer, cv=5, random_state=1)
# Fitting parameters
xgb_tuned2.fit(X_train, y_train)
print(
"Best Parameters:{} \nScore: {}".format(xgb_tuned2.best_params_, xgb_tuned2.best_score_)
)
# Creating new pipeline with best parameters
xgb_tuned2 = Pipeline(
steps=[
("scaler", StandardScaler()),
(
"XGB",
XGBClassifier(
random_state=1,
n_estimators=100,
scale_pos_weight=10,
gamma=5,
subsample=0.9,
learning_rate= 0.01,
eval_metric='logloss'
),
),
]
)
# Fit the model on training data
xgb_tuned2.fit(X_train, y_train)
# Calculating different metrics
get_metrics_score(xgb_tuned2,X_train,X_test,y_train,y_test)
# creating confusion matrix
make_confusion_matrix(xgb_tuned2,y_test)
# defining list of model
models = [lr]
# defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
# looping through all the models to get the metrics score - Accuracy, Recall and Precision
for model in models:
j = get_metrics_score(model,X_train,X_test,y_train,y_test,False)
acc_train.append(j[0])
acc_test.append(j[1])
recall_train.append(j[2])
recall_test.append(j[3])
precision_train.append(j[4])
precision_test.append(j[5])
# defining list of model
models = [lr_under]
# looping through all the models to get the metrics score - Accuracy, Recall and Precision
for model in models:
j = get_metrics_score(model,X_train_un,X_test,y_train_un,y_test,False)
acc_train.append(j[0])
acc_test.append(j[1])
recall_train.append(j[2])
recall_test.append(j[3])
precision_train.append(j[4])
precision_test.append(j[5])
# defining list of models
models = [lr_over, lr_reg]
# looping through all the models to get the metrics score - Accuracy, Recall and Precision
for model in models:
j = get_metrics_score(model,X_train_over,X_test,y_train_over,y_test,False)
acc_train.append(j[0])
acc_test.append(j[1])
recall_train.append(j[2])
recall_test.append(j[3])
precision_train.append(j[4])
precision_test.append(j[5])
# defining list of model
models = [dtree_tuned1, dtree_tuned2, xgb_tuned1, xgb_tuned2]
# looping through all the models to get the metrics score - Accuracy, Recall and Precision
for model in models:
j = get_metrics_score(model,X_train,X_test,y_train,y_test,False)
acc_train.append(j[0])
acc_test.append(j[1])
recall_train.append(j[2])
recall_test.append(j[3])
precision_train.append(j[4])
precision_test.append(j[5])
comparison_frame = pd.DataFrame({'Model':['Logistic Regression', 'Logistic Regression on Undersampled data',
'Logistic Regression on Oversampled data',
'Logistic Regression-Regularized (Oversampled data)',
'Decision tree with grid search',
'Decision tree with random search',
'XGBoost with grid search',
'XGBoost with random search',
],
'Train_Accuracy': acc_train,'Test_Accuracy': acc_test,
'Train_Recall':recall_train,'Test_Recall':recall_test,
'Train_Precision':precision_train,'Test_Precision':precision_test})
comparison_frame
feature_names = X_train.columns
importances = xgb_tuned1[1].feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
- hold a lower number of products
- stay inactive for a longer period
- have a lower revolving balance
- spend less on credit card
- make fewer transactions with their credit card
- have a bigger drop in transactions count between Q1 and Q4.
- have a lower average utilization ratio
- have more contacts with the bank